Developing an Implicit Solvation Machine Learning Model for Molecular Simulations of Ionic Media

Molecular dynamics (MD) simulations of biophysical systems require accurate modeling of their native environment, i.e., aqueous ionic solution, as it critically impacts the structure and function of biomolecules. On the other hand, the models should be computationally efficient to enable simulations of large spatiotemporal scales. Here, we present the deep implicit solvation model for sodium chloride solutions that satisfies both requirements. Owing to the use of the neural network potential, the model can capture the many-body potential of mean force, while the implicit water treatment renders the model inexpensive. We demonstrate our approach first for pure ionic solutions with concentrations ranging from physiological to 2 M. We then extend the model to capture the effective ion interactions in the vicinity and far away from a DNA molecule. In both cases, the structural properties are in good agreement with all-atom MD, showcasing a general methodology for the efficient and accurate modeling of ionic media.


INTRODUCTION
−15 The all-atom MD explicitly models all atoms in the simulated system.However, since the computational cost scales with the number of atoms, the method is often inapplicable to biologically relevant time scales and system sizes.Approaches like enhanced sampling techniques 16−19 can overcome this limitation to some extent.
Alternatively, to reduce the computational complexity, coarse-grained (CG) modeling 20−25 can be employed.It involves reducing the simulated degrees of freedom either by merging groups of correlated atoms into effective interaction sites or by treating part of the system, typically solvent, implicitly.The latter case can drastically reduce the number of explicit particle−particle interactions and the resulting computational costs since the solvent can comprise more than 90% of the simulated system.The solvent, treated as a dielectric continuum, can be considered with methods such as the Poisson−Boltzmann (PB), 26,27 COSMO/polarized continuum model, 28 the Generalized Born model, 29−31 or calculating the effective potential between solutes. 32However, the accuracy of these methods is often inadequate, e.g., some cannot maintain stable nucleic acid structures, or they introduce structural bias in proteins. 33,34evertheless, with multiscale simulations, one can employ CG models in the bulk region, while more accurate all-atom models can be used in the vicinity of biomolecules.Using the adaptive resolution simulation scheme, 35−38 the solvent, i.e., water molecules and ions, can change the resolution on-the-fly from all-atom to CG 39 or implicit hydration 40 and vice versa.This approach was, for example, employed to efficiently simulate a DNA molecule 38,40 and to study the mechanism governing the phase transitions of the high-density DNA arrays. 41,42achine learning (ML) paves the way for a new possibility in the past decade. 43The pioneering work of Behler showcased that deep neural networks (NN) could be employed to learn a computationally cheaper surrogate model for the density functional theory potential energy surface of bulk silicon. 44ollowing the initial studies, 45 other ML algorithms and NN architectures rapidly emerged. 46−55 ML algorithms can be utilized to construct not only all-atom models but also CG models. 56,57Previous studies demonstrated a successful reproduction of structural 58−64 and dynamical 65−67 properties.However, most considered pure water solvent or implicitly treated the ions, even though they play a vital role in biological processes.For example, the ionic atmosphere has a crucial effect on the secondary and tertiary structure's stability, the binding of charged drugs and proteins, and nucleic acid folding. 68−75 Several approaches were developed to fit the potential of mean force, which corresponds to a high-dimensional energy function.These methods are the iterative Boltzmann inversion (IBI), 76,77 the force matching, 78−80 and the relative entropy. 81hese methods have been widely used to determine the nonbonded interaction for aqueous salt solutions 82,83 and soft matter. 77The resulting potentials are pairwise.However, for the force-matched potential, it has been shown that including 3-body nonbonded interaction has an impact on structural and thermodynamic properties. 84For the calculation of a CG protein force field, the importance of higher-order terms was demonstrated to reach accuracy close to the all-atom model. 85,86As stated before, ML potentials are designed to learn the many-body atomic interaction. 87,88They are, by design, more adequate to approximate the many-body terms of the potential of the mean force.
Another crucial point regarding the simulation of ions and highly charged molecules is the electrostatic interactions. 89For example, without ions in solution, the effective DNA−DNA interactions are purely repulsive. 90This implies an accurate description of long-range interactions.Most of the ML potentials learn the local geometry of atoms around a central one until a defined cutoff distance.The potentials are thus "short-ranged" and neglect long-range physical effects.For isotropic systems such as water solutions, for example, the ML potential without an explicit electrostatic term already gives excellent results. 91On the contrary, for battery materials, it was shown that a long-range electrostatic treatment is needed when the system is anisotropic. 92Anisotropic environments for solutes are also observed in biological systems and in other interfacial systems.To address this major drawback, different ML architectures explicitly describe the electrostatic energy such as charge equilibration, 93 the long-distance equivariant framework LODE, 43,94 and the fourth-generation potential. 53,95−98 This work presents a deep implicit solvation (DIS) model for sodium chloride solutions, where water is coarse-grained out, whereas ions are modeled explicitly.The ML potential is based on an equivariant neural network (ENN) architecture due to its impressive data efficiency and the ability to generalize more accurately to out-of-distribution configurations. 51,52,54,55,99,100As stated by the Allegro's developers "the strict locality of the Allegro model naturally facilitates separation of the energy into a short-range term and a physically motivated long-range term."Thus, rather than directly fitting the potential of mean force, we define a prior potential composed of the Lennard-Jones and electrostatic interactions.The ML potential is trained to account for the difference between the all-atom data and the prior potential, an approach also known as delta learning. 101Our model can, therefore, account for the long-range electrostatic interactions, which are crucial for the accurate treatment of ions and highly charged molecules such as DNA.While some previous ML potentials included long-range electrostatics, 53,102,103 these models were computationally much more expensive.First, we validate the sodium chloride aqueous solution model at different salt concentrations ranging from 0.15 to 2.0 mol L −1 .After showing excellent performance for structural properties, we introduce a periodic DNA molecule into the system and demonstrate that our DIS model can accurately describe the effective ion interactions proximal and distal to the DNA biomolecule.

Database Generation.
We performed classical allatom simulations to obtain a database for the DIS model training and validation.These simulations were also used to compute the reference all-atom properties.We employ the TIP3P water model 104 and AMBER 03 force field 105 for the DNA molecule.The sequence of the periodic DNA molecule is CTCTCGAGAG.The Joung and Cheatham parameters 106 are used for the ions with additional corrections to ion−phosphate interactions. 107The nonbonded interactions are calculated within cutoff distances of 0.9 and 1.2 nm, respectively, for the LJ and the electrostatic potential.The electrostatic interactions beyond the cutoff are corrected with the PPPM solver. 108Since the aim of this study was focused on obtaining an accurate many-body potential of mean force for the ions, the atoms of the DNA molecule were frozen.A flexible DNA pitch could also be considered.All simulations are equilibrated in the NPT ensemble for 2 ns, followed by equilibrations in the NVT ensemble of 10 and 15 ns for 1.0 and 0.5 mol L −1 , respectively.During the production runs, carried out in the NVT ensemble, the forces applied on ions are saved every 1 ps.For pure salt solution systems, the number of configurations in the data set varies from 8 × 10 4 for 2.0 mol L −1 salt concentration to 3 × 10 5 for 0.15 mol L −1 salt concentration.For systems with the DNA molecule, the data set contains 1.8 × 10 5 configurations.For all cases, we randomly split the data set into training (80%) and validation (20%) data sets.

DIS Model.
In the DIS model, the ions and the DNA molecule are treated explicitly, while the water is coarsegrained out (Figure 1).The DNA model is coarse-grained because the full all-atom description would not be computationally efficient for the ML potential of ions.To find the optimal level of coarse-graining, we explored different DNA representations in the literature. 109,110The selected CG mapping is based on the DNA model developed by Kovaleva et al., 110 where each nucleotide is represented by six CG sites.The CG model is sufficiently complex to enable the ML potential to correctly fit the effective interactions in the vicinity of the DNA.
The many-body potential of the mean force for the ions is composed of two parts: an ML potential and a prior potential.The ML potential is an ENN Allegro. 51It uses a strictly local many-body equivariant representation, resulting in excellent computational efficiency.For further information, the Allegro's potential is described in Supporting Information and more in detail in developer's papers. 51,111The parameters of ML model are trained with the force-matching approach, 112 i.e., the training loss is defined as where F ijk is the force in the k-direction of the j-th ion in the ith configuration.We use the Adam optimizer 113 with the default parameters from pytorch. 114The details of the fit are reported in Supporting Information.
The prior potential prevents the exploration of physically invalid regions of the potential energy surface, such as particle overlaps. 64,115,116We find it to be especially important for small salt concentrations, i.e., low-density systems.Moreover, the ions, especially Na + cations, exhibit numerous stable configurations within both the minor and major grooves. 15hese configurations are already effectively captured by the prior model.The prior potential models the van der Waals and electrostatic interactions.For the former, we use the 12−6 Lennard-Jones potential with empirically optimized parameters (Supporting Information Table S1) to best fit the reference ion structural properties.For the latter, we employ the Coulomb interaction with the Wolf summation 117,118 for the long-range correction.The nonbonded interactions are also calculated within cutoff distances of 0.9 and 1.2 nm, respectively, for the LJ and the electrostatic potential.Since water is treated implicitly as a dielectric continuum, the electrostatic interactions are screened.We use a dielectric constant of 95 as this value corresponds to the measured dielectric constant of the TIP3P water model. 119Additionally, to further improve the fit of the sodium-phosphate interaction, we added oxygen atoms bound to the phosphate atoms (Figure 1 purple).The added oxygen atoms improve the fit of the Na-phosphate first coordination shell.These atoms are explicitly modeled only with the Lennard-Jones potential; i.e., they have a zero charge and are not seen by the ML potential.

COMPUTATIONAL DETAILS
All simulations have been carried out using LAMMPS. 120ewton's equations of motion are integrated using the Velocity Verlet integrator 121,122 with a 1 fs time step.Simulations are performed at a temperature of 300 K.For all-atom simulations, we employ the Nose−Hoover thermostat 123 with a coupling constant of 0.1 ps.In the case of the NPT simulations maintained at 1.0 bar, we additionally use the Nose−Hoover barostat with a coupling constant of 1.0 ps.For  simulations employing the DIS model, we use the Langevin thermostat with a coupling constant of 0.1 ps.All simulations are performed under periodic boundary conditions.The cubic box edge is around 4.2 nm for the most concentrated solutions and 5.4 nm for the lowest one.For simulations including the DNA molecule, the box size is 8.5 × 8.5 × 3.4 nm, which corresponds to exactly one DNA pitch oriented in the zdirection.The periodic boundary conditions are also used for the DNA molecule as in previous studies. 124,125Thus, the DNA molecule is effectively infinitely long.To analyze the properties of the trained DIS model, we perform 50 ns NVT simulations after 1 ns equilibration for the pure salt solution systems.For DNA simulations, we perform a 25 ns NVT simulation after 1 ns equilibration.

RESULTS AND DISCUSSION
4.1.Pure Ionic Aqueous Solutions.In the following, we compare our developed DIS model with the all-atom model, which serves as a target reference.Additionally, we show the results for the prior potential to highlight the inadequacy of simple classical potentials and the improvement made by the ML potential.
We first consider the pure sodium chloride solution at four salt concentrations, 0.15, 0.5, 1.0, and 2.0 mol L −1 .We investigate the structural properties by computing the radial distribution functions (RDFs) for the three ion−ion interactions, i.e., Na + −Na + , Na + −Cl − , and Cl − −Cl − (Figure 2).As expected, we observe for all pairs that as concentration increases, the height of the first RDF's peak increases with no shift in the position of the peak.The RDFs for the prior model exhibit ideal gas characteristics, i.e., a local structure highly dissimilar from the all-atom reference.With the addition of the ML potential, the effective ion−ion potentials are corrected and the RDFs are in agreement.The positions of the peaks are very well reproduced, while slight differences in the intensity of the peaks are observed for some RDFs.The differences are slightly higher at lower concentrations.The number of ions within the cutoff sphere of the ML potential is small at low concentrations.In particular, the coordination numbers, indicating how many ions can be found on average in a particular range, are shown in Supporting Information Figure S1.At the physiological salt concentration (0.15 mol L −1 ), the coordination numbers are smaller than one within the entire cutoff sphere.Consequently, these ML models are challenging to train.For the Na−Cl interaction, at each concentration, the solvent-separated pair exhibits two configurations at 4.6 and 5.0 Å.These are SSIP configurations already observed on RDF and confirmed by the calculation of the McMillan−Mayer potential. 126,127It is due to hydrogen bondings between the water molecules in the cation's first coordination shell and the anion; it is more pronounced for other salts with a higher constant of association.
CG models are thermodynamically state dependent.In particular, they are salt concentration dependent.Thus, developing a transferable salt solution model that would be accurate at any arbitrary salt concentration is impossible.Nevertheless, we still expect that the models will be transferable for small salt concentration changes.We investigated this aspect by training the model at one specific concentration and testing it at all four concentrations.The corresponding RDFs are shown in Supporting Information Figures S2−S5.We observe a relatively good reproduction of proposed to include the temperature dependency of the CG models. 63To emphasize the importance of capturing the many-body term of the PMF, a pure sodium chloride solution at 5.0 mol kg −1 has been performed following the computational detail of Shen et al., 83 the detailed results are provided in the Supporting Information (Section 4.3).The DIS model and two other implicit models, i.e., the IBI potential 76,128 fitted at 5.0 mol kg −1 and the transferable effective potential of the cited paper. 83The RDFs clearly highlight that addressing high-order atomistic correlations explicitly is essential for achieving a more accurate model.
The employed force-matching training strategy can theoretically reproduce all of the structural and thermodynamical properties of the underlying all-atom model.The dynamic properties, however, cannot be matched.CG models typically exhibit faster dynamics due to a smoother potential energy surface.Indeed, the average self-diffusion coefficient of ions, calculated via Einstein relation, for the all-atom and the DIS simulations 1.2 × 10 −9 and 6.8 × 10 −9 m 2 s −1 , respectively.

Periodic DNA in an Ionic Aqueous Solution.
Having validated the DIS model for the aqueous NaCl salt solution, we introduced a periodic DNA molecule into the simulation box.The main challenge here is to accurately describe the ion interactions in different chemical environments, i.e., near the DNA molecule and bulk media.In the bulk, atom density is low and only composed of ions.In the vicinity of the DNA molecule, the density of the particles significantly increases.To be able to capture the effective interaction of the ions around the DNA molecule, the complexity of the atomistic geometry representation is increased, i.e., the number of Bessel functions, the sizes of the 2-body network and the high-order tensor are increasing.The ML potential details are reported in Supporting Information.We developed a DIS model for two salt concentrations.In the following, we discuss the 1.0 mol L −1 salt solution.The results at 0.5 mol L −1 salt concentration, reported in the Supporting Information, show the same tendency.
First, we calculate the sodium and chloride cylindrical normalized density profiles (NDPs) from the center of mass of the DNA molecule (Figure 3).For Na + ions, we obtain excellent agreement with all-atom simulations.For Cl − ions, the density in the vicinity of the DNA molecule is slightly lower in the DIS model.Nonetheless, the structural properties are still significantly more accurate compared to the classical CG models employing explicit solvent modeling. 129,130This result demonstrates that the ML model can provide an accurate many-body potential of mean force even with the reduced DNA representation.Namely, the all-atom model contains 634 DNA atoms, while the input to the ML model is based on 120 explicit DNA atoms.
To visualize the ion distribution around the DNA molecule in 3D, we calculate the likelihood of observing Na + ions at every grid point around the DNA molecule using a grid spacing of 0.5 Å.Despite a few discrepancies, Figure 4 confirms that the DIS model accurately describes the ion distribution.We computed the Na + average occupancy in the first hydration shell and the associated residence times to further validate this point (Figure 5).The ion behavior in the DNA molecule's backbone, minor, and major grooves is similar regarding the average occupancy.To enable a comparison between the allatom and DIS models, we performed the calculation based on  the CG DNA representation 110 for both models.We observe a preferential binding of Na + ions to the phosphodiester group and major groove, while the minor groove exhibits low occupancy.Similar findings were also previously reported. 124n the other hand, the all-atom and DIS models display differences in the residence times.In particular, the residence times for the DIS model are lower by an approximately constant factor of 2.8 for the phosphodiester group and 2.6 for the major groove.For the Minor groove, the range of values is between 1 and 9. Due to the faster dynamics of CG models, lower residence times are expected. 129However, as expected, the DIS model outperforms the prior model regarding the average and the standard deviation of the residence times.Indeed, the prior model gives a good first approximation of the potential of mean force regarding the average occupancy, but the local ion configurations around DNA are not described.Yet, Figures 3 and 5 confirm that the ML potential enables to description of the stable configurations of the ions in the vicinity of DNA and correctly approximate the many-body terms of the PMF.
For a fair comparison of simulation speedup, we run simulations on one CPU using the all-atom and the DIS models for two systems: pure salt solution at 0.15 mol L −1 and the DNA molecule embedded into a salt solution at 0.5 mol L −1 .We use cutoff distances of 0.9 and 1.2 nm for computing the LJ and the electrostatic interactions, respectively, in both systems.For the pure salt solution, we obtain 0.5 and 84 ns per day for the all-atom model and the DIS MD, respectively.For the system containing the DNA molecule, we obtain 0.1 and 1.5 ns per day.The DIS model is competitive because of its much lower computational requirements and allows for simulating long trajectories as well as increasing the size of the system.

CONCLUSIONS
In conclusion, we present a new implicit solvation ML model of ionic media based on an equivariant graph neural network approach.Our DIS model showcased excellent accuracy with respect to the structural properties of aqueous salt solutions with concentrations ranging from 0.15 to 2.0 mol L −1 .The molecular system including a DNA molecule likewise confirmed that the model can simultaneously capture the effective ion interaction in two distinct environments: close to the DNA molecule and in bulk.This work paves the way for efficient simulations of ionic media by using an implicit solvation model.Our approach is general and could be used with other ML potentials.This will allow us to go beyond the current size of the system to understand the complex effective interaction of biomolecules in ionic media. 131We will also consider other salt solutions with higher valency ions to understand the impact on the learning process and the complexity of the neural network.
MD simulations of DNA require an accurate incorporation of electrostatics into the model.We have used the deltalearning approach to derive the ML interaction potential where the prior model captures explicit long-range interactions using Wolf summation.This represents the main limitation of our ML model since explicit long-range electrostatics are lacking in the current model.As this might lead to structural artifacts, DNA is fixed in this study.This allows us to capture only the behavior of the ions alone.Nevertheless, such a model could already be useful to study solid−liquid interfaces, 132,133 for example.Another limitation is the lack of transferability to other DNA sequences, as the model has been trained on only one specific DNA sequence and its corresponding fixed configuration.A cutoff-dependent Allegro potential could address this issue by using a large cutoff distance for ions and a shorter one for the DNA atoms and allow the introduction of a flexible DNA molecule in our model.This extension would enable the exploration of various CG mappings and their impact on the PMF in order to enhance the precision of the latter.An implementation of the prior model in Allegro's framework could also be an improvement.Our future work will focus on extending our model along these lines.
Parameters of the prior model; description of the Allegro potential; details of the hyperparameters and training protocols for the machine learning potential; coordination number calculated from the RDFs of the Figure 2; RDFs for all the aqueous NaCl solutions with a deep potential trained at a specific concentration; RDFs for all the aqueous NaCl solutions with a deep potential trained with all concentration configurations; structural analyses of ions for a DNA molecule in an aqueous NaCl salt solution of 0.5 mol L −1 ; and comparison of pairwise and DIS potentials for a highly concentrated NaCl aqueous solution (PDF) on the HPC facilities.Finally, the financial support of the Slovenian Research Agency (Grant no.P1-0002) is gratefully acknowledged.

Figure 1 .
Figure 1.Cross section of the simulation box for simulations of a periodic DNA molecule at 1.0 mol L −1 sodium chloride salt solution.The sodium ions are shown in green, while the chloride ions are in ochre.(a) All-atom model with explicit solvation.The carbon, nitrogen, phosphate, oxygen, and hydrogen atoms are depicted in cyan, blue, yellow, red, and white, respectively.(b) DIS model with implicit solvation.Two types of oxygen atoms are colored red and purple.Both are explicitly defined in the prior model, but only the positions of the red oxygen atoms are used as input to the ML potential.

Figure 4 .
Figure 4. 3D distribution of Na + cations around the DNA pitch in a NaCl aqueous solution at 1.0 mol L −1 for the all-atom (a), DIS (b), and the difference between both simulations (c).The distribution is normalized over the trajectory and the total number of cations.For the sake of clarity, only the positions with a sufficient probability are represented.The black beads represent the CG resolution of the DNA molecule.

Figure 5 .
Figure 5. Average occupancy (a) and residence times (b) of Na + ions in the first hydration shell of the atoms of DNA at 1.0 mol L −1 .The error bars denote the standard deviation.The fast fluctuations (<1 ps) are omitted in the residence time calculation.